• The dataset that I chose from Kaggle is about the car prices in Oman in the year 2023. • This data consists of 16,898 entries with 19 columns. This dataset gives a brief idea of the market trends. • It contains columns related to the car brand, model, manufacturing year, color, regional specifications, transmission types, type of fuel used, exterior colors, mileage, type of paint, condition, body condition, licensing, insurance, payment methods for purchase, adjusted price, cities, neighborhoods and exterior and interior options. • I perform data cleaning and I do exploratory data analysis on this dataset to build our model.
Importing libraries
if (!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
library("ggplot2")
if (!require("devtools")) install.packages("devtools")
## Loading required package: devtools
## Loading required package: usethis
library("devtools")
if (!require("tidyr")) install.packages("tidyr")
## Loading required package: tidyr
library("tidyr")
if (!require("dplyr")) install.packages("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("dplyr")
if (!require("gtsummary")) install.packages("gtsummary")
## Loading required package: gtsummary
library("gtsummary")
if (!require("skimr")) install.packages("skimr")
## Loading required package: skimr
library('skimr')
if (!require("broom")) install.packages("broom")
## Loading required package: broom
library("broom")
if (!require("GGally")) install.packages("GGally")
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(GGally)
if (!require("ggfortify")) install.packages("ggfortify")
## Loading required package: ggfortify
library(ggfortify)
Loading Oman car 2023 data
df <- read.csv("oman_car_prices_2023.csv")
str(df)
## 'data.frame': 16898 obs. of 19 variables:
## $ Car.Make : chr "Mazda" "Toyota" "Nissan" "Nissan" ...
## $ Model : chr "3" "Camry" "Patrol" "Altima" ...
## $ Year : chr "2021" "2018" "2021" "2019" ...
## $ Regional.Specs : chr "American Specs" "American Specs" "GCC Specs" "American Specs" ...
## $ Transmission : chr "Automatic" "Automatic" "Automatic" "Automatic" ...
## $ Fuel : chr "Gasoline" "Gasoline" "Gasoline" "Gasoline" ...
## $ Color : chr "Nardo Grey" "White" "Grey" "White" ...
## $ Condition : chr "Used" "Used" "New" "Used" ...
## $ Kilometers : chr "50,000 - 59,999" "110,000 - 119,999" "0" "50,000 - 59,999" ...
## $ Paint : chr "Other" "Original Paint" "Original Paint" "Original Paint" ...
## $ Body.Condition : chr "Other" "Other" "Excellent with no defects" "Excellent with no defects" ...
## $ Car.License : chr "Licensed" "Licensed" "Not Licensed" "Licensed" ...
## $ Insurance : chr "Compulsory Insurance" "Compulsory Insurance" "Not Insured" "Compulsory Insurance" ...
## $ Payment.Method : chr "Cash Only" "Cash Only" "Cash or Installments" "Cash or Installments" ...
## $ City : chr "Muscat" "Muscat" "Muscat" "Al Batinah" ...
## $ Neighborhood : chr "Seeb" "Ansab" "Al Maabilah" "Saham" ...
## $ Exterior.Options: chr "Rim Wheels | Daytime Running Lights | LED Lights | Keyless Entry | Xenon Lights | Spare Tyre | Sunroof | Sports"| __truncated__ "Sunroof | Rear Sensors | Front Sensors" "LED Lights | Daytime Running Lights | Keyless Entry | Spare Tyre | Sunroof | Rear Sensors | Front Sensors | Bac"| __truncated__ "Keyless Entry | Electric Mirrors | Rim Wheels | LED Lights | Xenon Lights | Spare Tyre | Sports Package" ...
## $ Interior.Options: chr "Air Condition , ABS Brakes , Electric Windows , Cooled Seats , Heated Seats , Cruise Control , Sport Seats , Tyre Pressure Moni "360\xb0 Camera , Traction Control , Bluetooth , Android Auto , Media Screen , Touch Screen , AUX / USB Input , CD player , Navi "Air Condition , ABS Brakes , Electric Windows , Cooled Seats , Heated Seats , Cruise Control , Tyre Pressure Monitoring , Leath "Apple CarPlay , Airbags , Android Auto , Bluetooth , Electric Windows , Center Lock , Touch Screen , Leather Seats , Heated Sea ...
## $ adjusted_price : int 20661 18111 77570 15191 21461 17318 19949 21936 11369 22752 ...
Checking for nulls
is.na(df) %>% colSums()
## Car.Make Model Year Regional.Specs
## 0 0 0 0
## Transmission Fuel Color Condition
## 0 0 0 0
## Kilometers Paint Body.Condition Car.License
## 0 0 0 0
## Insurance Payment.Method City Neighborhood
## 0 0 0 0
## Exterior.Options Interior.Options adjusted_price
## 0 0 0
# Find the indices of rows where Year is 'Older than 1970'
rows_to_drop <- which(df$Year == 'Older than 1970')
# Drop the rows
df <- df[-rows_to_drop, ]
# Convert Year column to integer type
df$Year <- as.integer(df$Year)
# Count the number of rows where Year is greater than or equal to 2000
count <- sum(df$Year >= 2015)
# Subset the dataframe to keep only rows where Year is greater than or equal to 2000
df <- subset(df, Year >= 2015)
# Print the number of rows and columns in the filtered dataframe
cat(dim(df))
## 10115 19
df <- df %>%
select(-c("Car.License", "Neighborhood", "Exterior.Options", "Interior.Options"))
head(df)
## Car.Make Model Year Regional.Specs Transmission Fuel Color
## 1 Mazda 3 2021 American Specs Automatic Gasoline Nardo Grey
## 2 Toyota Camry 2018 American Specs Automatic Gasoline White
## 3 Nissan Patrol 2021 GCC Specs Automatic Gasoline Grey
## 4 Nissan Altima 2019 American Specs Automatic Gasoline White
## 5 Hyundai Genesis 2018 American Specs Automatic Gasoline White
## 6 Honda Accord 2018 American Specs Automatic Gasoline Silver
## Condition Kilometers Paint Body.Condition
## 1 Used 50,000 - 59,999 Other Other
## 2 Used 110,000 - 119,999 Original Paint Other
## 3 New 0 Original Paint Excellent with no defects
## 4 Used 50,000 - 59,999 Original Paint Excellent with no defects
## 5 Used 70,000 - 79,999 Other Other
## 6 Used 90,000 - 99,999 Other Other
## Insurance Payment.Method City adjusted_price
## 1 Compulsory Insurance Cash Only Muscat 20661
## 2 Compulsory Insurance Cash Only Muscat 18111
## 3 Not Insured Cash or Installments Muscat 77570
## 4 Compulsory Insurance Cash or Installments Al Batinah 15191
## 5 Compulsory Insurance Cash or Installments Al Dakhiliya 21461
## 6 Compulsory Insurance Cash or Installments Al Dakhiliya 17318
Filtering the data whose car price is greater than 5000 to shorten the range of prices in the dataset. Using complete data is has range from USD 378 to USD 129k.
df <- subset(df, adjusted_price >= 5000)
Here, I am cleaning the Kilometers and Condition variables. I see that some data has Condition=New but Kilometers range is more than 50K. Which I felt there is no meaning to it. Hence I cleaned those data points.
# Find the indices of rows where Condition is "New" but Kilometers are not zero
df$Kilometers <- as.factor(df$Kilometers)
# Update 'Condition' to "Used" where 'Condition' is "New" and 'Kilometers' is not "1 - 999" or "0"
df <- df %>%
mutate(Condition = ifelse(Condition == "New" & !(Kilometers %in% c("1 - 999", "0")), "Used", Condition))
# Print the updated data frame and its shape
print(dim(df))
## [1] 9989 15
Converted the data type of kilometer from string to integer and taken the mean of loIr and upper boundry of the kilometer.
# get loIr and upper bounds of each range
library(stringr)
df$Kilometers <- gsub(",", "", df$Kilometers)
df <- df %>%
mutate(LoIr = as.numeric(str_extract(Kilometers, "\\d+(?= - )")),
Upper = as.numeric(str_extract(Kilometers, "(?<=- )\\d+")))
# Calculate the mean of the ranges
df <- df %>%
mutate(Mean_Kilometers = ifelse(is.na(Upper), LoIr, ceiling((LoIr + Upper) / 2)))
df <- df %>%
select(-c("Kilometers","LoIr", "Upper"))
df[is.na(df)] <- 0
# Display the result
head(df)
## Car.Make Model Year Regional.Specs Transmission Fuel Color
## 1 Mazda 3 2021 American Specs Automatic Gasoline Nardo Grey
## 2 Toyota Camry 2018 American Specs Automatic Gasoline White
## 3 Nissan Patrol 2021 GCC Specs Automatic Gasoline Grey
## 4 Nissan Altima 2019 American Specs Automatic Gasoline White
## 5 Hyundai Genesis 2018 American Specs Automatic Gasoline White
## 6 Honda Accord 2018 American Specs Automatic Gasoline Silver
## Condition Paint Body.Condition Insurance
## 1 Used Other Other Compulsory Insurance
## 2 Used Original Paint Other Compulsory Insurance
## 3 New Original Paint Excellent with no defects Not Insured
## 4 Used Original Paint Excellent with no defects Compulsory Insurance
## 5 Used Other Other Compulsory Insurance
## 6 Used Other Other Compulsory Insurance
## Payment.Method City adjusted_price Mean_Kilometers
## 1 Cash Only Muscat 20661 55000
## 2 Cash Only Muscat 18111 115000
## 3 Cash or Installments Muscat 77570 0
## 4 Cash or Installments Al Batinah 15191 55000
## 5 Cash or Installments Al Dakhiliya 21461 75000
## 6 Cash or Installments Al Dakhiliya 17318 95000
#Data Prep Created the new feature (“Luxury_Car”) to classify the luxury car brands.
premium_car_brands <- c(
'Lexus', 'Mercedes Benz', 'BMW', 'Audi', 'Jaguar', 'Porsche',
'Infiniti', 'Maserati', 'Bentley', 'Alfa Romeo',"Tesla", 'Hongqi', 'Haval', 'Maxus', 'GAC', 'GMC',
'Rolls Royce', 'Land Rover', 'Cadillac', 'Ferrari', 'Lamborghini', 'Changan', "Volvo"
)
# Create a new column 'Luxury_Car' (1 for luxury car, 0 for non-luxury car)
df <- df %>%
mutate(Luxury_Car = ifelse(Car.Make %in% premium_car_brands, 1, 0))
# Calculate fleet size for each car make
fleet <- df %>%
group_by(Car.Make) %>%
summarize(Fleet_Size = n())
# Merge fleet size back to the original data frame
df <- df %>%
left_join(fleet, by = 'Car.Make')
df <- df %>%
select(-c("Car.Make", "Model"))
head(df)
## Year Regional.Specs Transmission Fuel Color Condition Paint
## 1 2021 American Specs Automatic Gasoline Nardo Grey Used Other
## 2 2018 American Specs Automatic Gasoline White Used Original Paint
## 3 2021 GCC Specs Automatic Gasoline Grey New Original Paint
## 4 2019 American Specs Automatic Gasoline White Used Original Paint
## 5 2018 American Specs Automatic Gasoline White Used Other
## 6 2018 American Specs Automatic Gasoline Silver Used Other
## Body.Condition Insurance Payment.Method
## 1 Other Compulsory Insurance Cash Only
## 2 Other Compulsory Insurance Cash Only
## 3 Excellent with no defects Not Insured Cash or Installments
## 4 Excellent with no defects Compulsory Insurance Cash or Installments
## 5 Other Compulsory Insurance Cash or Installments
## 6 Other Compulsory Insurance Cash or Installments
## City adjusted_price Mean_Kilometers Luxury_Car Fleet_Size
## 1 Muscat 20661 55000 0 88
## 2 Muscat 18111 115000 0 3389
## 3 Muscat 77570 0 0 1461
## 4 Al Batinah 15191 55000 0 1461
## 5 Al Dakhiliya 21461 75000 0 649
## 6 Al Dakhiliya 17318 95000 0 496
Removed the outlier from the target variable
# Calculate quartiles and IQR
# Create the boxplot
ggplot(df, aes(x = "", y = adjusted_price)) +
geom_boxplot() +
labs(y = "Adjusted Price") +
theme_minimal() +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
q1 <- quantile(df$adjusted_price, 0.25)
q3 <- quantile(df$adjusted_price, 0.75)
IQR <- q3 - q1
# Calculate loIr and upper bounds for outliers
min_ <- q1 - (1.5 * IQR)
max_ <- q3 + (1.5 * IQR)
# Remove outliers
new_df <- df[df$adjusted_price > min_ & df$adjusted_price < max_, ]
print(dim(new_df))
## [1] 9431 15
summary(new_df$adjusted_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5000 10784 15145 16801 20684 39996
Created the mapping function for the car Body.Condition
# Define mapping dictionary
replace_mapping <- c(
"Other",
"Poor (severe body damages)",
"Fair (body needs work)",
"Good (body only has minor blemishes)",
"Excellent with no defects"
)
new_df$Body.Condition <- factor(new_df$Body.Condition, levels = replace_mapping, labels = c(0, 1, 2, 3, 4))
head(new_df)
## Year Regional.Specs Transmission Fuel Color Condition Paint
## 1 2021 American Specs Automatic Gasoline Nardo Grey Used Other
## 2 2018 American Specs Automatic Gasoline White Used Original Paint
## 4 2019 American Specs Automatic Gasoline White Used Original Paint
## 5 2018 American Specs Automatic Gasoline White Used Other
## 6 2018 American Specs Automatic Gasoline Silver Used Other
## 7 2018 American Specs Automatic Gasoline Burgundy Used Total repaint
## Body.Condition Insurance Payment.Method City
## 1 0 Compulsory Insurance Cash Only Muscat
## 2 0 Compulsory Insurance Cash Only Muscat
## 4 4 Compulsory Insurance Cash or Installments Al Batinah
## 5 0 Compulsory Insurance Cash or Installments Al Dakhiliya
## 6 0 Compulsory Insurance Cash or Installments Al Dakhiliya
## 7 2 Compulsory Insurance Cash Only Al Batinah
## adjusted_price Mean_Kilometers Luxury_Car Fleet_Size
## 1 20661 55000 0 88
## 2 18111 115000 0 3389
## 4 15191 55000 0 1461
## 5 21461 75000 0 649
## 6 17318 95000 0 496
## 7 19949 75000 0 3389
Created box plot between Fuel and price Fuel_Electric has wide range of prices, I can only use single category during analysis
ggplot(new_df, aes(x = Fuel, y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Car Fuel Type")
Created box plot between Fuel and price Fuel_Electric has wide range of
prices, I can only use single category during analysis
ggplot(new_df, aes(x = Fuel, y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Car Fuel Type")
Created box plot between Car Regional Specs and Price GCC has wide range of prices, I can only use single category during analysis
ggplot(new_df, aes(x = Regional.Specs, y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Car Fuel Type")
Created box plot between Fuel and price Manual Transmission has wide range of prices, I can only use single category during analysis
ggplot(new_df, aes(x = Transmission, y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Car Fuel Type")
Created box plot betIen Fuel and price Body Condition 4 has wide range
of prices, I can only use single category during analysis
ggplot(new_df, aes(x = Body.Condition, y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Car Fuel Type")
Created box plot between Fuel and price Original Paint & Total Repaint has wide range of prices, I can only use single category during analysis
ggplot(new_df, aes(x = Paint, y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Car Fuel Type")
Created box plot between Fuel and price I have diversified price ranges betIen 2 categories of Luxury car
ggplot(new_df, aes(x = as.factor(Luxury_Car), y = adjusted_price)) +
geom_boxplot(fill="#1b98e0") +
labs(title = "Distribution of Luxury Car")
Avg. Price change over Years
new_df %>%
group_by(Year) %>%
summarize(mean_adjusted_price = mean(adjusted_price)) %>%
ggplot(aes(x = Year, y = mean_adjusted_price)) +
geom_line() +
labs(title = "Mean Adjusted Price by Year")
Price vs. Year
# Scatter plot
ggplot(new_df, aes(x = Year, y = adjusted_price)) +
geom_point() +
labs(x = "Year", y = "Adjusted Price") +
ggtitle("Scatter Plot of Year vs Adjusted Price")
Mean_Kilometers vs Price
# Scatter plot
ggplot(new_df, aes(x = Mean_Kilometers, y = adjusted_price)) +
geom_point() +
labs(x = "Mean Kilometers", y = "Adjusted Price") +
ggtitle("Scatter Plot of Mean Kilometers vs Adjusted Price")
Fleet_Size vs Price
# Scatter plot
ggplot(new_df, aes(x = Fleet_Size, y = adjusted_price)) +
geom_point() +
labs(x = "Mean Kilometers", y = "Adjusted Price") +
ggtitle("Scatter Plot of Mean Kilometers vs Adjusted Price")
Histogram on Price
# Create the histogram
ggplot(new_df, aes(x = adjusted_price)) +
geom_histogram(bins = 20, fill = "#1b98e0", color = "black") +
labs(x = "adjusted_price", y = "Frequency", title = "Histogram of adjusted_price")
Using the scatter plot understood the relation between numeric data
ggpairs(dplyr::select(new_df, "adjusted_price", "Year","Mean_Kilometers", "Fleet_Size")) +
ggtitle("Scatterplot matrix on numeric columns")
Removed the color attribute from the data
new_df <- new_df %>%
select(-c("Color"))
head(new_df)
## Year Regional.Specs Transmission Fuel Condition Paint
## 1 2021 American Specs Automatic Gasoline Used Other
## 2 2018 American Specs Automatic Gasoline Used Original Paint
## 4 2019 American Specs Automatic Gasoline Used Original Paint
## 5 2018 American Specs Automatic Gasoline Used Other
## 6 2018 American Specs Automatic Gasoline Used Other
## 7 2018 American Specs Automatic Gasoline Used Total repaint
## Body.Condition Insurance Payment.Method City
## 1 0 Compulsory Insurance Cash Only Muscat
## 2 0 Compulsory Insurance Cash Only Muscat
## 4 4 Compulsory Insurance Cash or Installments Al Batinah
## 5 0 Compulsory Insurance Cash or Installments Al Dakhiliya
## 6 0 Compulsory Insurance Cash or Installments Al Dakhiliya
## 7 2 Compulsory Insurance Cash Only Al Batinah
## adjusted_price Mean_Kilometers Luxury_Car Fleet_Size
## 1 20661 55000 0 88
## 2 18111 115000 0 3389
## 4 15191 55000 0 1461
## 5 21461 75000 0 649
## 6 17318 95000 0 496
## 7 19949 75000 0 3389
Converted the categorical data with the dummy variables
# Specify categorical columns
cat_cols <- c("Regional.Specs", "Transmission", "Fuel", "Paint", "Condition", "Fleet_Size",
"City", "Insurance", "Payment.Method")
# Create dummy variables
dummies <- model.matrix(~ . - 1, data = new_df[cat_cols])
# Convert the resulting matrix to a data frame
d_df <- as.data.frame(dummies)
head(d_df)
## Regional.SpecsAmerican Specs Regional.SpecsEuropean Specs
## 1 1 0
## 2 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## Regional.SpecsGCC Specs Regional.SpecsJapanese Specs
## 1 0 0
## 2 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## Regional.SpecsOther Specs TransmissionManual FuelElectric FuelGasoline
## 1 0 0 0 1
## 2 0 0 0 1
## 4 0 0 0 1
## 5 0 0 0 1
## 6 0 0 0 1
## 7 0 0 0 1
## FuelHybrid FuelMild Hybrid FuelPlug-in - Hybrid PaintOther
## 1 0 0 0 1
## 2 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 1
## 6 0 0 0 1
## 7 0 0 0 0
## PaintPartially repainted PaintTotal repaint ConditionUsed Fleet_Size
## 1 0 0 1 88
## 2 0 0 1 3389
## 4 0 0 1 1461
## 5 0 0 1 649
## 6 0 0 1 496
## 7 0 1 1 3389
## CityAl Dakhiliya CityAl Dhahirah CityAl Sharqiya CityAl Wustaa CityBuraimi
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 4 0 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## 7 0 0 0 0 0
## CityDhofar CityMusandam CityMuscat InsuranceCompulsory Insurance
## 1 0 0 1 1
## 2 0 0 1 1
## 4 0 0 0 1
## 5 0 0 0 1
## 6 0 0 0 1
## 7 0 0 0 1
## InsuranceNot Insured Payment.MethodCash or Installments
## 1 0 0
## 2 0 0
## 4 0 1
## 5 0 1
## 6 0 1
## 7 0 0
## Payment.MethodInstallments Only
## 1 0
## 2 0
## 4 0
## 5 0
## 6 0
## 7 0
Combined the original data with the categorical and dummy values
# Combine original dataframe without categorical columns and dummy variables
processed_df <- cbind(
new_df[, !(names(new_df) %in% cat_cols)],
d_df
)
# Print the processed dataframe
names(processed_df)
## [1] "Year" "Body.Condition"
## [3] "adjusted_price" "Mean_Kilometers"
## [5] "Luxury_Car" "Regional.SpecsAmerican Specs"
## [7] "Regional.SpecsEuropean Specs" "Regional.SpecsGCC Specs"
## [9] "Regional.SpecsJapanese Specs" "Regional.SpecsOther Specs"
## [11] "TransmissionManual" "FuelElectric"
## [13] "FuelGasoline" "FuelHybrid"
## [15] "FuelMild Hybrid" "FuelPlug-in - Hybrid"
## [17] "PaintOther" "PaintPartially repainted"
## [19] "PaintTotal repaint" "ConditionUsed"
## [21] "Fleet_Size" "CityAl Dakhiliya"
## [23] "CityAl Dhahirah" "CityAl Sharqiya"
## [25] "CityAl Wustaa" "CityBuraimi"
## [27] "CityDhofar" "CityMusandam"
## [29] "CityMuscat" "InsuranceCompulsory Insurance"
## [31] "InsuranceNot Insured" "Payment.MethodCash or Installments"
## [33] "Payment.MethodInstallments Only"
# Rename the columns with no spaces
colnames(processed_df) <- c("Year", "Body.Condition", "adjusted_price", "Mean_Kilometers", "Luxury_Car",
"Regional.Specs_AmericanSpecs", "Regional.Specs_EuropeanSpecs", "Regional.Specs_GCCSpecs",
"Regional.Specs_JapaneseSpecs", "Regional.Specs_OtherSpecs", "Transmission_Manual",
"Fuel_Electric", "Fuel_Gasoline", "Fuel_Hybrid", "Fuel_MildHybrid", "Fuel_Plug-inHybrid",
"Paint_Other", "Paint_Partiallyrepainted", "Paint_Totalrepaint", "Condition_Used",
"Fleet_Size", "City_AlDakhiliya", "City_AlDhahirah", "City_AlSharqiya", "City_AlWustaa", "City_Dhofar",
"City_Buraimi", "City_Musandam", "City_Muscat", "Insurance_CompulsoryInsurance",
"Insurance_NotInsured", "PaymentMethod_CashorInstallments", "PaymentMethod_InstallmentsOnly")
#
# Print the column names after renaming
(colnames(processed_df))
## [1] "Year" "Body.Condition"
## [3] "adjusted_price" "Mean_Kilometers"
## [5] "Luxury_Car" "Regional.Specs_AmericanSpecs"
## [7] "Regional.Specs_EuropeanSpecs" "Regional.Specs_GCCSpecs"
## [9] "Regional.Specs_JapaneseSpecs" "Regional.Specs_OtherSpecs"
## [11] "Transmission_Manual" "Fuel_Electric"
## [13] "Fuel_Gasoline" "Fuel_Hybrid"
## [15] "Fuel_MildHybrid" "Fuel_Plug-inHybrid"
## [17] "Paint_Other" "Paint_Partiallyrepainted"
## [19] "Paint_Totalrepaint" "Condition_Used"
## [21] "Fleet_Size" "City_AlDakhiliya"
## [23] "City_AlDhahirah" "City_AlSharqiya"
## [25] "City_AlWustaa" "City_Dhofar"
## [27] "City_Buraimi" "City_Musandam"
## [29] "City_Muscat" "Insurance_CompulsoryInsurance"
## [31] "Insurance_NotInsured" "PaymentMethod_CashorInstallments"
## [33] "PaymentMethod_InstallmentsOnly"
Filter the final dataset for modeling by filtering the features
# Define columns to keep
cols_to_keep <- c('Year', 'adjusted_price', 'Mean_Kilometers',"Regional.Specs_AmericanSpecs","Regional.Specs_JapaneseSpecs", "Regional.Specs_GCCSpecs",
"Transmission_Manual", "Fuel_Gasoline", "Fuel_Hybrid", "Fuel_Electric",
"Paint_Totalrepaint", "Body.Condition",
"Luxury_Car", "Fleet_Size",
'PaymentMethod_CashorInstallments', 'PaymentMethod_InstallmentsOnly',
'City_Musandam', 'City_Muscat','City_AlWustaa',
"Insurance_CompulsoryInsurance", 'Insurance_NotInsured')
#
# Subset the dataframe
processed_df_subset <- processed_df[, cols_to_keep]
# Print the subsetted dataframe
print(dim(processed_df_subset))
## [1] 9431 21
head((processed_df_subset))
## Year adjusted_price Mean_Kilometers Regional.Specs_AmericanSpecs
## 1 2021 20661 55000 1
## 2 2018 18111 115000 1
## 4 2019 15191 55000 1
## 5 2018 21461 75000 1
## 6 2018 17318 95000 1
## 7 2018 19949 75000 1
## Regional.Specs_JapaneseSpecs Regional.Specs_GCCSpecs Transmission_Manual
## 1 0 0 0
## 2 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## Fuel_Gasoline Fuel_Hybrid Fuel_Electric Paint_Totalrepaint Body.Condition
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 4 1 0 0 0 4
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## 7 1 0 0 1 2
## Luxury_Car Fleet_Size PaymentMethod_CashorInstallments
## 1 0 88 0
## 2 0 3389 0
## 4 0 1461 1
## 5 0 649 1
## 6 0 496 1
## 7 0 3389 0
## PaymentMethod_InstallmentsOnly City_Musandam City_Muscat City_AlWustaa
## 1 0 0 1 0
## 2 0 0 1 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## Insurance_CompulsoryInsurance Insurance_NotInsured
## 1 1 0
## 2 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
dim(processed_df_subset)
## [1] 9431 21
Split the data with the 60% as training, 20% as validation, and 20% for testing
tvid<-sample(nrow(processed_df_subset), ceiling(nrow(processed_df_subset)*0.6))
mytrain<-processed_df_subset[tvid,]
mytest_val<-processed_df_subset[-tvid,]
vid<-sample(nrow(mytest_val), ceiling(nrow(mytest_val)*0.5))
mytest<-mytest_val[vid,]
myval<-mytest_val[-vid,]
head(mytrain)
## Year adjusted_price Mean_Kilometers Regional.Specs_AmericanSpecs
## 9859 2016 15850 55000 0
## 1803 2016 11308 85000 1
## 4828 2017 9671 95000 1
## 4876 2015 8767 0 1
## 7488 2019 35033 55000 0
## 3599 2020 10811 55000 1
## Regional.Specs_JapaneseSpecs Regional.Specs_GCCSpecs Transmission_Manual
## 9859 0 0 0
## 1803 0 0 0
## 4828 0 0 0
## 4876 0 0 0
## 7488 0 1 0
## 3599 0 0 0
## Fuel_Gasoline Fuel_Hybrid Fuel_Electric Paint_Totalrepaint Body.Condition
## 9859 1 0 0 0 0
## 1803 1 0 0 0 0
## 4828 1 0 0 0 4
## 4876 1 0 0 0 4
## 7488 1 0 0 0 4
## 3599 1 0 0 0 2
## Luxury_Car Fleet_Size PaymentMethod_CashorInstallments
## 9859 1 844 1
## 1803 0 496 0
## 4828 0 1461 1
## 4876 0 3389 0
## 7488 0 49 0
## 3599 0 286 0
## PaymentMethod_InstallmentsOnly City_Musandam City_Muscat City_AlWustaa
## 9859 0 0 0 0
## 1803 0 0 0 0
## 4828 0 0 0 0
## 4876 0 0 1 0
## 7488 0 0 1 0
## 3599 0 0 1 0
## Insurance_CompulsoryInsurance Insurance_NotInsured
## 9859 0 1
## 1803 1 0
## 4828 1 0
## 4876 1 0
## 7488 0 0
## 3599 1 0
After data pre-processing and EDA, I started with modeling using machine learning models by using statistical learning techniques. As there are many categorical variables in the dataset, I addressed modeling with linear regression and non-parametric machine learning models. I have built 5 models in total. During modeling, I have used Mean Squared Error metric to compare models in this analysis.
Models created: 1. Multiple Linear Regression (All Variables) 2. Multiple Linear Regression (Forward Stepwise) 3. Decision Tree Regressor 4. Artificial Neural Network 5. KNN
I started with Linear regression model with all variables in it. I can see the developed model below.
RSquared achieved - 0.4301 (43%) Train MSE - 33460671.692449 Test MSE - 36871038.218541
lm.allvars.fit1 = lm( adjusted_price ~ Year +Mean_Kilometers+Regional.Specs_AmericanSpecs+Regional.Specs_JapaneseSpecs+Regional.Specs_GCCSpecs+Transmission_Manual+Fuel_Gasoline+Fuel_Hybrid+Fuel_Electric+Paint_Totalrepaint+Body.Condition+Luxury_Car+Fleet_Size+PaymentMethod_CashorInstallments+PaymentMethod_InstallmentsOnly+City_Musandam+City_AlWustaa+City_Muscat+Insurance_CompulsoryInsurance+Insurance_NotInsured, data= mytrain)
print(summary(lm.allvars.fit1))
##
## Call:
## lm(formula = adjusted_price ~ Year + Mean_Kilometers + Regional.Specs_AmericanSpecs +
## Regional.Specs_JapaneseSpecs + Regional.Specs_GCCSpecs +
## Transmission_Manual + Fuel_Gasoline + Fuel_Hybrid + Fuel_Electric +
## Paint_Totalrepaint + Body.Condition + Luxury_Car + Fleet_Size +
## PaymentMethod_CashorInstallments + PaymentMethod_InstallmentsOnly +
## City_Musandam + City_AlWustaa + City_Muscat + Insurance_CompulsoryInsurance +
## Insurance_NotInsured, data = mytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26714 -3953 -1119 2825 26947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.187e+06 9.283e+04 -34.334 < 2e-16 ***
## Year 1.588e+03 4.601e+01 34.515 < 2e-16 ***
## Mean_Kilometers -1.696e-03 1.412e-03 -1.201 0.229739
## Regional.Specs_AmericanSpecs 3.034e+02 2.587e+02 1.172 0.241090
## Regional.Specs_JapaneseSpecs -3.630e+02 6.130e+02 -0.592 0.553792
## Regional.Specs_GCCSpecs 1.749e+03 3.221e+02 5.429 5.91e-08 ***
## Transmission_Manual 2.619e+03 4.567e+02 5.735 1.03e-08 ***
## Fuel_Gasoline -2.713e+03 9.178e+02 -2.956 0.003125 **
## Fuel_Hybrid -2.743e+03 1.357e+03 -2.022 0.043249 *
## Fuel_Electric -6.012e+03 2.179e+03 -2.759 0.005820 **
## Paint_Totalrepaint -7.265e+02 4.711e+02 -1.542 0.123092
## Body.Condition1 -1.328e+03 6.224e+02 -2.133 0.032961 *
## Body.Condition2 -6.097e+02 3.206e+02 -1.902 0.057253 .
## Body.Condition3 -5.811e+00 3.191e+02 -0.018 0.985473
## Body.Condition4 2.035e+02 1.948e+02 1.045 0.296008
## Luxury_Car 1.057e+04 2.179e+02 48.509 < 2e-16 ***
## Fleet_Size 3.780e-01 6.524e-02 5.793 7.27e-09 ***
## PaymentMethod_CashorInstallments 1.514e+03 1.644e+02 9.206 < 2e-16 ***
## PaymentMethod_InstallmentsOnly -9.706e+02 1.271e+03 -0.764 0.445121
## City_Musandam -2.868e+03 4.184e+03 -0.686 0.493048
## City_AlWustaa 5.713e+02 3.420e+03 0.167 0.867348
## City_Muscat 6.145e+02 1.654e+02 3.716 0.000204 ***
## Insurance_CompulsoryInsurance -9.885e+02 2.480e+02 -3.986 6.79e-05 ***
## Insurance_NotInsured -3.141e+03 3.942e+02 -7.968 1.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5905 on 5635 degrees of freedom
## Multiple R-squared: 0.4205, Adjusted R-squared: 0.4182
## F-statistic: 177.8 on 23 and 5635 DF, p-value: < 2.2e-16
lm.train.preds <- (predict(lm.allvars.fit1, newdata = mytrain))
lm.fw.train.MSE <- mean((lm.train.preds-mytrain$adjusted_price)^2)
print(paste("Train MSE", lm.fw.train.MSE))
## [1] "Train MSE 34724305.7306432"
lm.test.preds <- (predict(lm.allvars.fit1, newdata = mytest))
lm.fw.test.MSE <- mean((lm.test.preds-mytest$adjusted_price)^2)
print(paste("TEST MSE", lm.fw.test.MSE))
## [1] "TEST MSE 34553256.3125353"
autoplot(lm.allvars.fit1, which=1, nrow=1, ncol= 1)
autoplot(lm.allvars.fit1, which=2, nrow=1, ncol= 1)
autoplot(lm.allvars.fit1, which=3, nrow=1, ncol= 1)
Linear regression model with transformed Year and Mean_Kilometers
From the residual plots, the mean residuals have moved nearer to zero but Q-Q plot shown bad results.
lm.allvars.fit = lm( adjusted_price ~ I( log(Year) + Year^2 + Mean_Kilometers^2) +Regional.Specs_AmericanSpecs+Regional.Specs_JapaneseSpecs+Regional.Specs_GCCSpecs+Transmission_Manual+Fuel_Gasoline+Fuel_Hybrid+Fuel_Electric+Paint_Totalrepaint+Body.Condition+Luxury_Car+Fleet_Size+PaymentMethod_CashorInstallments+PaymentMethod_InstallmentsOnly+City_Musandam+City_AlWustaa+City_Muscat+Insurance_CompulsoryInsurance+Insurance_NotInsured, data= mytrain)
print(summary(lm.allvars.fit))
##
## Call:
## lm(formula = adjusted_price ~ I(log(Year) + Year^2 + Mean_Kilometers^2) +
## Regional.Specs_AmericanSpecs + Regional.Specs_JapaneseSpecs +
## Regional.Specs_GCCSpecs + Transmission_Manual + Fuel_Gasoline +
## Fuel_Hybrid + Fuel_Electric + Paint_Totalrepaint + Body.Condition +
## Luxury_Car + Fleet_Size + PaymentMethod_CashorInstallments +
## PaymentMethod_InstallmentsOnly + City_Musandam + City_AlWustaa +
## City_Muscat + Insurance_CompulsoryInsurance + Insurance_NotInsured,
## data = mytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19887 -4549 -1364 3186 26044
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.652e+04 1.067e+03 15.484
## I(log(Year) + Year^2 + Mean_Kilometers^2) -9.309e-08 9.369e-09 -9.936
## Regional.Specs_AmericanSpecs 4.856e+02 2.831e+02 1.716
## Regional.Specs_JapaneseSpecs -3.710e+01 6.717e+02 -0.055
## Regional.Specs_GCCSpecs 1.777e+03 3.532e+02 5.030
## Transmission_Manual 2.149e+03 5.002e+02 4.295
## Fuel_Gasoline -2.651e+03 1.006e+03 -2.636
## Fuel_Hybrid -2.911e+03 1.487e+03 -1.958
## Fuel_Electric -4.939e+03 2.389e+03 -2.068
## Paint_Totalrepaint -2.023e+03 5.146e+02 -3.931
## Body.Condition1 -1.230e+03 6.822e+02 -1.802
## Body.Condition2 -3.074e+02 3.511e+02 -0.876
## Body.Condition3 1.443e+02 3.496e+02 0.413
## Body.Condition4 2.216e+02 2.134e+02 1.039
## Luxury_Car 9.874e+03 2.379e+02 41.506
## Fleet_Size 5.013e-01 7.134e-02 7.027
## PaymentMethod_CashorInstallments 2.296e+03 1.784e+02 12.868
## PaymentMethod_InstallmentsOnly -4.575e+02 1.393e+03 -0.328
## City_Musandam -2.072e+03 4.586e+03 -0.452
## City_AlWustaa -5.801e+02 3.749e+03 -0.155
## City_Muscat 1.271e+03 1.798e+02 7.067
## Insurance_CompulsoryInsurance -1.665e+03 2.709e+02 -6.144
## Insurance_NotInsured -2.840e+03 4.320e+02 -6.575
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## I(log(Year) + Year^2 + Mean_Kilometers^2) < 2e-16 ***
## Regional.Specs_AmericanSpecs 0.08630 .
## Regional.Specs_JapaneseSpecs 0.95596
## Regional.Specs_GCCSpecs 5.06e-07 ***
## Transmission_Manual 1.77e-05 ***
## Fuel_Gasoline 0.00842 **
## Fuel_Hybrid 0.05033 .
## Fuel_Electric 0.03870 *
## Paint_Totalrepaint 8.58e-05 ***
## Body.Condition1 0.07155 .
## Body.Condition2 0.38132
## Body.Condition3 0.67984
## Body.Condition4 0.29893
## Luxury_Car < 2e-16 ***
## Fleet_Size 2.37e-12 ***
## PaymentMethod_CashorInstallments < 2e-16 ***
## PaymentMethod_InstallmentsOnly 0.74262
## City_Musandam 0.65142
## City_AlWustaa 0.87703
## City_Muscat 1.78e-12 ***
## Insurance_CompulsoryInsurance 8.61e-10 ***
## Insurance_NotInsured 5.30e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6473 on 5636 degrees of freedom
## Multiple R-squared: 0.3036, Adjusted R-squared: 0.3009
## F-statistic: 111.7 on 22 and 5636 DF, p-value: < 2.2e-16
autoplot(lm.allvars.fit, which=1, nrow=1, ncol= 1)
autoplot(lm.allvars.fit, which=2, nrow=1, ncol= 1)
autoplot(lm.allvars.fit, which=3, nrow=1, ncol= 1)
lm.train.preds <- (predict(lm.allvars.fit, newdata = mytrain))
lm.fw.train.MSE <- mean((lm.train.preds-mytrain$adjusted_price)^2)
print(paste("Train MSE", lm.fw.train.MSE))
## [1] "Train MSE 41727937.3187412"
lm.test.preds <- (predict(lm.allvars.fit, newdata = mytest))
lm.fw.test.MSE <- mean((lm.test.preds-mytest$adjusted_price)^2)
print(paste("TEST MSE", lm.fw.test.MSE))
## [1] "TEST MSE 43110836.9821421"
Later, I modeling Forward stepwise multiple linear regression. As there are many explanatory variables in the previous model, I wanted to reduces the features without losing the model response. Hence I run the forward stepwise model below.
RSquared - 0.43 (43%) Train MSE - 33469287.547 TEST MSE - 36907041.534
fullmodel<-lm(adjusted_price~.,data= mytrain)
my_intercept<-lm(adjusted_price~1, data=mytrain)
lm.forward.fit<-step(my_intercept, direction='forward', scope=formula(fullmodel), trace=0)
summary(lm.forward.fit)
##
## Call:
## lm(formula = adjusted_price ~ Luxury_Car + Year + Regional.Specs_GCCSpecs +
## PaymentMethod_CashorInstallments + Insurance_NotInsured +
## Transmission_Manual + Fleet_Size + Insurance_CompulsoryInsurance +
## City_Muscat + Body.Condition + Fuel_Electric + Fuel_Gasoline +
## Fuel_Hybrid + Paint_Totalrepaint, data = mytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26584 -3979 -1140 2872 27124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.207e+06 9.105e+04 -35.223 < 2e-16 ***
## Luxury_Car 1.056e+04 2.174e+02 48.566 < 2e-16 ***
## Year 1.598e+03 4.513e+01 35.405 < 2e-16 ***
## Regional.Specs_GCCSpecs 1.476e+03 2.218e+02 6.652 3.17e-11 ***
## PaymentMethod_CashorInstallments 1.517e+03 1.639e+02 9.256 < 2e-16 ***
## Insurance_NotInsured -3.118e+03 3.933e+02 -7.927 2.69e-15 ***
## Transmission_Manual 2.619e+03 4.546e+02 5.762 8.77e-09 ***
## Fleet_Size 3.807e-01 6.503e-02 5.854 5.07e-09 ***
## Insurance_CompulsoryInsurance -9.802e+02 2.477e+02 -3.958 7.67e-05 ***
## City_Muscat 6.052e+02 1.644e+02 3.680 0.000235 ***
## Body.Condition1 -1.303e+03 6.212e+02 -2.098 0.035972 *
## Body.Condition2 -5.865e+02 3.164e+02 -1.853 0.063863 .
## Body.Condition3 1.607e+00 3.153e+02 0.005 0.995933
## Body.Condition4 2.124e+02 1.915e+02 1.109 0.267601
## Fuel_Electric -6.173e+03 2.174e+03 -2.840 0.004532 **
## Fuel_Gasoline -2.697e+03 9.169e+02 -2.941 0.003281 **
## Fuel_Hybrid -2.717e+03 1.356e+03 -2.004 0.045125 *
## Paint_Totalrepaint -7.283e+02 4.706e+02 -1.547 0.121806
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5905 on 5641 degrees of freedom
## Multiple R-squared: 0.42, Adjusted R-squared: 0.4183
## F-statistic: 240.3 on 17 and 5641 DF, p-value: < 2.2e-16
fw_selected = names(coef(lm.forward.fit)[-1])
lm.train.preds <- (predict(lm.forward.fit, newdata = mytrain))
print(paste("Train MSE", mean((lm.train.preds-mytrain$adjusted_price)^2)))
## [1] "Train MSE 34754382.5593576"
When I look at the above models I can see that test MSE is more than train MSE, which means there is a degree of overfitting problem. Hence I need to address it. I have used regularization (Ridge regression) to tackle it. I have run the model using cross validation to identify optimal lambda.
I tried to perform Lasso Regression as Ill, hoIver the results are not convincing. So, I used through Ridge regression.
Optimal Lambda - 366.9 Train MSE - 33540464.176 Test MSE - 36978495.368
From the results, I see that the results are similar to the previous Linear regression. Due to this data I did not identify any improvement in the model to overcome overfitting.
set.seed(5555)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:gtsummary':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
Xs_design<-model.matrix(~ ., dplyr::select(mytrain, select=-c(adjusted_price)))[, -1]
Xs_test_design<-model.matrix(~ ., dplyr::select(mytest, select=-c(adjusted_price)))[, -1]
Xs_val_design<-model.matrix(~ ., dplyr::select(myval, select=-c(adjusted_price)))[, -1]
ridgeCV<-cv.glmnet(x=Xs_design, y=mytrain$adjusted_price, family='gaussian', type.measure="mse", alpha=0, nfolds=10)
print(ridgeCV)
##
## Call: cv.glmnet(x = Xs_design, y = mytrain$adjusted_price, type.measure = "mse", nfolds = 10, family = "gaussian", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 375.1 100 35137102 927717 23
## 1se 1661.7 84 36018527 932808 23
ridge_pred<-predict(ridgeCV,s='lambda.min',newx=Xs_design)
print(paste("Train MSE: ", mean((ridge_pred-mytrain$adjusted_price)^2)))
## [1] "Train MSE: 34805364.2697455"
To check the performance using non-parametric models, I implemented Decision Tree Regressor from the library rpart. I have tuned the decision tree model using minsplit and minbucket. I can see that the best model for this dataset is found to be minsplit=2, minbucket=4, gini index split.
Train MSE - 33940941.674
set.seed(5555)
library(rpart)
library(rpart.plot)
dt.allvars.fit<-rpart(adjusted_price~., data =
mytrain,parms=list(split="gini"),control=rpart.control(minsplit=2,minbucket=4))
print(summary(dt.allvars.fit))
## Call:
## rpart(formula = adjusted_price ~ ., data = mytrain, parms = list(split = "gini"),
## control = rpart.control(minsplit = 2, minbucket = 4))
## n= 5659
##
## CP nsplit rel error xerror xstd
## 1 0.23473884 0 1.0000000 1.0003109 0.01986570
## 2 0.10199247 1 0.7652612 0.7657066 0.01723029
## 3 0.03818311 2 0.6632687 0.6639561 0.01582855
## 4 0.01427397 3 0.6250856 0.6260962 0.01589487
## 5 0.01190603 4 0.6108116 0.6157780 0.01576720
## 6 0.01133010 6 0.5869995 0.6049484 0.01580221
## 7 0.01000000 7 0.5756694 0.5884939 0.01568743
##
## Variable importance
## Luxury_Car Year
## 49 35
## Fleet_Size Mean_Kilometers
## 7 3
## PaymentMethod_CashorInstallments City_Muscat
## 2 1
## Regional.Specs_AmericanSpecs Regional.Specs_GCCSpecs
## 1 1
## Insurance_CompulsoryInsurance Body.Condition
## 1 1
##
## Node number 1: 5659 observations, complexity param=0.2347388
## mean=16771.64, MSE=5.992371e+07
## left son=2 (4539 obs) right son=3 (1120 obs)
## Primary splits:
## Luxury_Car < 0.5 to the left, improve=0.23473880, (0 missing)
## Year < 2017.5 to the left, improve=0.09513050, (0 missing)
## Fleet_Size < 1152.5 to the right, improve=0.05396075, (0 missing)
## PaymentMethod_CashorInstallments < 0.5 to the left, improve=0.03111901, (0 missing)
## City_Muscat < 0.5 to the left, improve=0.01954907, (0 missing)
## Surrogate splits:
## City_Musandam < 0.5 to the left, agree=0.802, adj=0.002, (0 split)
## Fleet_Size < 2.5 to the right, agree=0.802, adj=0.001, (0 split)
##
## Node number 2: 4539 observations, complexity param=0.1019925
## mean=14908.6, MSE=4.359365e+07
## left son=4 (2388 obs) right son=5 (2151 obs)
## Primary splits:
## Year < 2017.5 to the left, improve=0.17479260, (0 missing)
## PaymentMethod_CashorInstallments < 0.5 to the left, improve=0.02748541, (0 missing)
## Fleet_Size < 391 to the right, improve=0.01900857, (0 missing)
## City_Muscat < 0.5 to the left, improve=0.01493402, (0 missing)
## Mean_Kilometers < 1e+05 to the right, improve=0.01285019, (0 missing)
## Surrogate splits:
## Mean_Kilometers < 90000 to the right, agree=0.583, adj=0.119, (0 split)
## PaymentMethod_CashorInstallments < 0.5 to the left, agree=0.567, adj=0.086, (0 split)
## Fleet_Size < 2425 to the left, agree=0.557, adj=0.065, (0 split)
## City_Muscat < 0.5 to the left, agree=0.556, adj=0.063, (0 split)
## Insurance_CompulsoryInsurance < 0.5 to the right, agree=0.541, adj=0.032, (0 split)
##
## Node number 3: 1120 observations, complexity param=0.03818311
## mean=24321.91, MSE=5.50311e+07
## left son=6 (878 obs) right son=7 (242 obs)
## Primary splits:
## Year < 2018.5 to the left, improve=0.210079400, (0 missing)
## PaymentMethod_CashorInstallments < 0.5 to the left, improve=0.042429710, (0 missing)
## Mean_Kilometers < 80000 to the right, improve=0.025102420, (0 missing)
## City_Muscat < 0.5 to the left, improve=0.011884550, (0 missing)
## Fleet_Size < 86 to the right, improve=0.008370062, (0 missing)
## Surrogate splits:
## Fleet_Size < 17 to the right, agree=0.795, adj=0.050, (0 split)
## Fuel_Electric < 0.5 to the left, agree=0.787, adj=0.017, (0 split)
## Fuel_Gasoline < 0.5 to the right, agree=0.786, adj=0.008, (0 split)
##
## Node number 4: 2388 observations
## mean=12288.75, MSE=2.681049e+07
##
## Node number 5: 2151 observations, complexity param=0.01427397
## mean=17817.11, MSE=4.614674e+07
## left son=10 (1892 obs) right son=11 (259 obs)
## Primary splits:
## Year < 2020.5 to the left, improve=0.04876429, (0 missing)
## Fleet_Size < 282.5 to the right, improve=0.02573351, (0 missing)
## Regional.Specs_GCCSpecs < 0.5 to the left, improve=0.02186721, (0 missing)
## PaymentMethod_CashorInstallments < 0.5 to the left, improve=0.01432732, (0 missing)
## Transmission_Manual < 0.5 to the left, improve=0.01384079, (0 missing)
## Surrogate splits:
## Fleet_Size < 3.5 to the right, agree=0.881, adj=0.008, (0 split)
##
## Node number 6: 878 observations, complexity param=0.0113301
## mean=22536.84, MSE=4.315924e+07
## left son=12 (476 obs) right son=13 (402 obs)
## Primary splits:
## Year < 2016.5 to the left, improve=0.10139200, (0 missing)
## Fleet_Size < 86 to the right, improve=0.03178194, (0 missing)
## PaymentMethod_CashorInstallments < 0.5 to the left, improve=0.01779855, (0 missing)
## Regional.Specs_GCCSpecs < 0.5 to the left, improve=0.01644718, (0 missing)
## City_Muscat < 0.5 to the left, improve=0.01559526, (0 missing)
## Surrogate splits:
## Mean_Kilometers < 80000 to the right, agree=0.567, adj=0.055, (0 split)
## Body.Condition splits as RLRRL, agree=0.565, adj=0.050, (0 split)
## Insurance_NotInsured < 0.5 to the left, agree=0.562, adj=0.042, (0 split)
## Regional.Specs_GCCSpecs < 0.5 to the right, agree=0.550, adj=0.017, (0 split)
## Fleet_Size < 6.5 to the right, agree=0.546, adj=0.007, (0 split)
##
## Node number 7: 242 observations
## mean=30798.34, MSE=4.45984e+07
##
## Node number 10: 1892 observations, complexity param=0.01190603
## mean=17262.09, MSE=4.26622e+07
## left son=20 (1581 obs) right son=21 (311 obs)
## Primary splits:
## Fleet_Size < 282.5 to the right, improve=0.03946604, (0 missing)
## Regional.Specs_GCCSpecs < 0.5 to the left, improve=0.01871216, (0 missing)
## Transmission_Manual < 0.5 to the left, improve=0.01464959, (0 missing)
## Insurance_NotInsured < 0.5 to the right, improve=0.01272611, (0 missing)
## Body.Condition splits as RLLRR, improve=0.01235126, (0 missing)
##
## Node number 11: 259 observations
## mean=21871.56, MSE=5.291248e+07
##
## Node number 12: 476 observations
## mean=20614.42, MSE=4.195353e+07
##
## Node number 13: 402 observations
## mean=24813.14, MSE=3.502935e+07
##
## Node number 20: 1581 observations
## mean=16686.58, MSE=3.518183e+07
##
## Node number 21: 311 observations, complexity param=0.01190603
## mean=20187.71, MSE=7.044636e+07
## left son=42 (147 obs) right son=43 (164 obs)
## Primary splits:
## Fleet_Size < 174 to the left, improve=0.22316530, (0 missing)
## PaymentMethod_CashorInstallments < 0.5 to the left, improve=0.04106879, (0 missing)
## Insurance_NotInsured < 0.5 to the right, improve=0.02260820, (0 missing)
## Mean_Kilometers < 110000 to the right, improve=0.01957177, (0 missing)
## Regional.Specs_AmericanSpecs < 0.5 to the left, improve=0.01245478, (0 missing)
## Surrogate splits:
## Regional.Specs_AmericanSpecs < 0.5 to the left, agree=0.711, adj=0.388, (0 split)
## Regional.Specs_GCCSpecs < 0.5 to the right, agree=0.698, adj=0.361, (0 split)
## Year < 2018.5 to the right, agree=0.605, adj=0.163, (0 split)
## Body.Condition splits as RLRRL, agree=0.601, adj=0.156, (0 split)
## Transmission_Manual < 0.5 to the right, agree=0.556, adj=0.061, (0 split)
##
## Node number 42: 147 observations
## mean=15999.72, MSE=4.055163e+07
##
## Node number 43: 164 observations
## mean=23941.59, MSE=6.742951e+07
##
## n= 5659
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 5659 339108300000 16771.64
## 2) Luxury_Car< 0.5 4539 197871600000 14908.60
## 4) Year< 2017.5 2388 64023440000 12288.75 *
## 5) Year>=2017.5 2151 99261630000 17817.11
## 10) Year< 2020.5 1892 80716880000 17262.09
## 20) Fleet_Size>=282.5 1581 55622480000 16686.58 *
## 21) Fleet_Size< 282.5 311 21908820000 20187.71
## 42) Fleet_Size< 174 147 5961090000 15999.72 *
## 43) Fleet_Size>=174 164 11058440000 23941.59 *
## 11) Year>=2020.5 259 13704330000 21871.56 *
## 3) Luxury_Car>=0.5 1120 61634830000 24321.91
## 6) Year< 2018.5 878 37893810000 22536.84
## 12) Year< 2016.5 476 19969880000 20614.42 *
## 13) Year>=2016.5 402 14081800000 24813.14 *
## 7) Year>=2018.5 242 10792810000 30798.34 *
rpart.plot(dt.allvars.fit, box.palette = "Blues", shadow.col = "gray", nn = TRUE, extra = 'auto')
dt_pred_train<-predict(dt.allvars.fit,s='lambda.min',newx=mytrain)
print(paste("Train MSE: ",mean((dt_pred_train-mytrain$adjusted_price)^2)))
## [1] "Train MSE: 34496249.6910938"
To see the response of ANN model using this data, I performed ANN modeling using NeuralNet. I first started with scaling the train, test and validation data using PreProcess()
library(caret)
## Loading required package: lattice
head(mytrain)
## Year adjusted_price Mean_Kilometers Regional.Specs_AmericanSpecs
## 9859 2016 15850 55000 0
## 1803 2016 11308 85000 1
## 4828 2017 9671 95000 1
## 4876 2015 8767 0 1
## 7488 2019 35033 55000 0
## 3599 2020 10811 55000 1
## Regional.Specs_JapaneseSpecs Regional.Specs_GCCSpecs Transmission_Manual
## 9859 0 0 0
## 1803 0 0 0
## 4828 0 0 0
## 4876 0 0 0
## 7488 0 1 0
## 3599 0 0 0
## Fuel_Gasoline Fuel_Hybrid Fuel_Electric Paint_Totalrepaint Body.Condition
## 9859 1 0 0 0 0
## 1803 1 0 0 0 0
## 4828 1 0 0 0 4
## 4876 1 0 0 0 4
## 7488 1 0 0 0 4
## 3599 1 0 0 0 2
## Luxury_Car Fleet_Size PaymentMethod_CashorInstallments
## 9859 1 844 1
## 1803 0 496 0
## 4828 0 1461 1
## 4876 0 3389 0
## 7488 0 49 0
## 3599 0 286 0
## PaymentMethod_InstallmentsOnly City_Musandam City_Muscat City_AlWustaa
## 9859 0 0 0 0
## 1803 0 0 0 0
## 4828 0 0 0 0
## 4876 0 0 1 0
## 7488 0 0 1 0
## 3599 0 0 1 0
## Insurance_CompulsoryInsurance Insurance_NotInsured
## 9859 0 1
## 1803 1 0
## 4828 1 0
## 4876 1 0
## 7488 0 0
## 3599 1 0
scaling_mytrain <- preProcess(dplyr::select(mytrain, -c('adjusted_price')), method = c("center", "scale"))
scaled_mytrain <- predict(scaling_mytrain,
newdata = dplyr::select(mytrain, -c("adjusted_price")))
scaled_mytrain <- cbind(scaled_mytrain, dplyr::select(mytrain, c("adjusted_price")))
scaling_mytest <- preProcess(dplyr::select(mytest, -c("adjusted_price")), method = c("center", "scale"))
## Warning in preProcess.default(dplyr::select(mytest, -c("adjusted_price")), :
## These variables have zero variances: City_Musandam, City_AlWustaa
scaled_mytest <- predict(scaling_mytest,
newdata = dplyr::select(mytest, -c("adjusted_price")))
scaled_mytest <- cbind(scaled_mytest, dplyr::select(mytest, c("adjusted_price")))
scaling_myval <- preProcess(dplyr::select(myval, -c("adjusted_price")), method = c("center", "scale"))
## Warning in preProcess.default(dplyr::select(myval, -c("adjusted_price")), :
## These variables have zero variances: City_Musandam, City_AlWustaa
scaled_myval <- predict(scaling_myval,
newdata = dplyr::select(myval, -c("adjusted_price")))
scaled_myval <- cbind(scaled_myval, dplyr::select(myval, c("adjusted_price")))
dim(scaled_mytrain)
## [1] 5659 21
head(scaled_mytrain)
## Year Mean_Kilometers Regional.Specs_AmericanSpecs
## 9859 -0.8156119 -0.09177959 -1.3920745
## 1803 -0.8156119 0.42380297 0.7182254
## 4828 -0.2617240 0.59566382 0.7182254
## 4876 -1.3694997 -1.03701427 0.7182254
## 7488 0.8460517 -0.09177959 -1.3920745
## 3599 1.3999396 -0.09177959 0.7182254
## Regional.Specs_JapaneseSpecs Regional.Specs_GCCSpecs Transmission_Manual
## 9859 -0.1420828 -0.5088307 -0.1888964
## 1803 -0.1420828 -0.5088307 -0.1888964
## 4828 -0.1420828 -0.5088307 -0.1888964
## 4876 -0.1420828 -0.5088307 -0.1888964
## 7488 -0.1420828 1.9649431 -0.1888964
## 3599 -0.1420828 -0.5088307 -0.1888964
## Fuel_Gasoline Fuel_Hybrid Fuel_Electric Paint_Totalrepaint Body.Condition
## 9859 0.1263947 -0.07888111 -0.03990788 -0.1732843 0
## 1803 0.1263947 -0.07888111 -0.03990788 -0.1732843 0
## 4828 0.1263947 -0.07888111 -0.03990788 -0.1732843 4
## 4876 0.1263947 -0.07888111 -0.03990788 -0.1732843 4
## 7488 0.1263947 -0.07888111 -0.03990788 -0.1732843 4
## 3599 0.1263947 -0.07888111 -0.03990788 -0.1732843 2
## Luxury_Car Fleet_Size PaymentMethod_CashorInstallments
## 9859 2.0129487 -0.5818923 1.1530526
## 1803 -0.4966959 -0.8384764 -0.8671099
## 4828 -0.4966959 -0.1269716 1.1530526
## 4876 -0.4966959 1.2945635 -0.8671099
## 7488 -0.4966959 -1.1680543 -0.8671099
## 3599 -0.4966959 -0.9933117 -0.8671099
## PaymentMethod_InstallmentsOnly City_Musandam City_Muscat City_AlWustaa
## 9859 -0.06246676 -0.01880111 -0.8949528 -0.0230286
## 1803 -0.06246676 -0.01880111 -0.8949528 -0.0230286
## 4828 -0.06246676 -0.01880111 -0.8949528 -0.0230286
## 4876 -0.06246676 -0.01880111 1.1171799 -0.0230286
## 7488 -0.06246676 -0.01880111 1.1171799 -0.0230286
## 3599 -0.06246676 -0.01880111 1.1171799 -0.0230286
## Insurance_CompulsoryInsurance Insurance_NotInsured adjusted_price
## 9859 -2.0253568 3.6702541 15850
## 1803 0.4936529 -0.2724126 11308
## 4828 0.4936529 -0.2724126 9671
## 4876 0.4936529 -0.2724126 8767
## 7488 -2.0253568 -0.2724126 35033
## 3599 0.4936529 -0.2724126 10811
I built a feed forward neural network which contains 3 hidden layers with 7, 5 and 3 nodes respectively. I have used learning rate and threshold as 0.01. I have also used logistic activation function but I also tried to run sigmoid and Softmax activation functions but I could not run the model.
I can see the network graph image below
if (!require(neuralnet))
{install.packages(neuralnet)
}
## Loading required package: neuralnet
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(neuralnet)
# softplus<-function(x) {log(1+exp(x))}
sigmoid<-function(x) {1/(1+exp(-x))}
set.seed(5555)
scaled_mytrain_design<-model.matrix(~ ., data = scaled_mytrain)[, -1]
scaled_mytest_design<-model.matrix(~ ., data = scaled_mytest)[, -1]
scaled_myval_design<-model.matrix(~ ., data = scaled_myval)[, -1]
# Xs_test_design<-model.matrix(~ ., dplyr::select(mytest, select=-c(adjusted_price)))[, -1]
train_ann<-neuralnet(adjusted_price~., data=scaled_mytrain_design, hidden=c(7,5,3),rep=3,linear.output=TRUE,algorithm='rprop+',learningrate=.01, act.fct='logistic', stepmax=1e+06, lifesign='minimal',threshold=0.01)
## hidden: 7, 5, 3 thresh: 0.01 rep: 1/3 steps:
## 41964 error: 169554140790.735 time: 2.35 mins
## hidden: 7, 5, 3 thresh: 0.01 rep: 2/3 steps: 41980 error: 169554140790.735 time: 2.31 mins
## hidden: 7, 5, 3 thresh: 0.01 rep: 3/3 steps: 41946 error: 169554140790.736 time: 2.31 mins
plot(train_ann, rep="best",show.Iights =TRUE,fontsize = 10)
ann.train.preds<-predict(train_ann, scaled_mytrain_design)
print(paste("Train MSE", mean((ann.train.preds-scaled_mytrain$adjusted_price)^2)))
## [1] "Train MSE 59923711.1824475"
#KNN Finally, I also developed the model using KNN. I have used K-fold cross validation technique with 5 folds to get optimal k value for the data. Using that I built KNN model. I even tried using LooCV cross validation but the model runs infinite time.
I can see the results of the model.
Optimal K = 5 Train MSE - 25927190.361 This is the least MSE I found out compared to training data. In the next section I will use test data to evaluate the model candidates.
library(caret)
# Define the training control
train_control <- trainControl(method = "cv", number = 5)
# Train the KNN model
knn_model <- train(adjusted_price ~ ., data = mytrain, method = "knn", trControl = train_control, tuneLength = 10)
# Print the model
print((knn_model))
## k-Nearest Neighbors
##
## 5659 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4528, 4527, 4527, 4527, 4527
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 6164.678 0.3778709 4449.098
## 7 6220.034 0.3624611 4522.365
## 9 6265.807 0.3501957 4587.937
## 11 6348.779 0.3310960 4675.484
## 13 6428.428 0.3135109 4759.746
## 15 6487.356 0.3003560 4832.093
## 17 6549.820 0.2862534 4908.764
## 19 6614.630 0.2716473 4981.768
## 21 6665.955 0.2601872 5042.176
## 23 6699.103 0.2525524 5086.271
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Make predictions
predictions <- predict(knn_model, newdata = mytrain)
mse <- mean((predictions - mytrain$adjusted_price)^2)
print(paste("Train MSE", mse))
## [1] "Train MSE 26070632.7047127"
I again begin with Linear regression. I have evaluated the model using residual plots as Ill.
From the residual vs. Fitted values plot, I can see that the variance is not constant. I can also see that mean of residuals is not zero. Q-Q plot also moves out of reference line. Which means the model is violating assumptions.
I tried to perform transformation on the explanatory variables like Year, Mean_Kilometers, and Fleet_Size by log square and sqrt transformation but it did not give us much improvement in the model.
Test MSE - 36871038.216
lm.test.preds <- (predict(lm.allvars.fit1, newdata = mytest))
lm.fw.test.MSE <- mean((lm.test.preds-mytest$adjusted_price)^2)
print(paste("TEST MSE", lm.fw.test.MSE))
## [1] "TEST MSE 34553256.3125353"
# lm.test.preds <- (predict(lm.allvars.fit, newdata = myval))
# print(paste("TEST MSE", mean((lm.test.preds-myval$adjusted_price)^2)))
autoplot(lm.allvars.fit1, which=1, nrow=1, ncol= 1)
autoplot(lm.allvars.fit1, which=2, nrow=1, ncol= 1)
autoplot(lm.allvars.fit1, which=3, nrow=1, ncol= 1)
In the Forward stepwise results on the basis of test data. I can see the
residual plots. Again, the residual analysis is not aligned with linear
regression assumptions.
Test MSE - 36907041.534
lm.test.preds <- (predict(lm.forward.fit, newdata = mytest))
print(paste("TEST MSE: ", mean((lm.test.preds-mytest$adjusted_price)^2)))
## [1] "TEST MSE: 34616957.2908361"
# lm.val.preds <- (predict(lm.forward.fit, newdata = myval))
# print(paste("Val MSE: ", mean((lm.train.preds-myval$adjusted_price)^2)))
autoplot(lm.forward.fit, which=1, nrow=1, ncol= 1)
autoplot(lm.forward.fit, which=2, nrow=1, ncol= 1)
autoplot(lm.forward.fit, which=3, nrow=1, ncol= 1)
Results from Ridge
Test MSE: 36978495.368
ridge_pred<-predict(ridgeCV,s='lambda.min',newx=Xs_test_design)
print(paste("Test MSE: " ,mean((ridge_pred-mytest$adjusted_price)^2)))
## [1] "Test MSE: 34681433.4049551"
# ridge_pred<-predict(ridgeCV,s='lambda.min',newx=Xs_val_design)
# print(paste("Test MSE: " ,mean((ridge_pred-myval$adjusted_price)^2)))
Results from Decision Tree regressor,
Test MSE - 36318409.0511821
I found very bad MSE on test data using decision tree regressor. The test MSE is far larger than training MSE.
dt_pred_test<-predict(dt.allvars.fit,newdata=mytest)
print (paste("TEST MSE: ", mean((dt_pred_test-mytest$adjusted_price)^2)))
## [1] "TEST MSE: 35239243.557078"
#
# dt_pred_val<-predict(dt.allvars.fit,s='lambda.min',newx=myval)
# print (paste("VAL MSE: ", mean((dt_pred_test-myval$adjusted_price)^2)))
Results from ANN model
Test MSE - 61934887.647
ann.test.preds <-predict(train_ann, scaled_mytest_design)
print(paste("TEST MSE", mean((ann.test.preds-scaled_mytest$adjusted_price)^2)))
## [1] "TEST MSE 60603937.9901676"
Results from KNN
Test MSE - 39351310.513808
predictions <- predict(knn_model, newdata = mytest)
mse <- mean((predictions - mytest$adjusted_price)^2)
print(paste("Test MSE", mse))
## [1] "Test MSE 39197011.2271018"
# predictions <- predict(knn_model, newdata = myval)
# mse <- mean((predictions - myval$adjusted_price)^2)
# print(paste("val MSE", mse))
From the above analysis, I found out to be KNN is performing better and has provided least MSE on training data but on Test data MSE is comparatively higher than other models.
I saw that least test MSE is found to be on Decision Tree model with all variables which is 36318409.051. Which is little higher than linear models but linear models fails to validate assumptions.
To conclude based on the train and test data MSEs, I decided that decision tree as our final model.
To better understand the results, again I compared the Linear model, Decision Tree, and KNN models using validation data. Decision Tree model have resulted better results overall on the validation data. Which means Decision Tree can predict car price better on the new data points, which is reliable.
# ## finalized model -- KNN at K = 5 I got least Rsquared value as 0.362
lm.val.preds <- (predict(lm.allvars.fit, newdata = myval))
print(paste("Validation MSE", mean((lm.val.preds-myval$adjusted_price)^2)))
## [1] "Validation MSE 45149597.2067025"
predictions <- predict(knn_model, newdata = myval)
mse <- mean((predictions - myval$adjusted_price)^2)
print(paste("val MSE", mse))
## [1] "val MSE 39804156.4482956"
dt_pred_test<-predict(dt.allvars.fit,newdata=myval)
print (paste("TEST MSE: ", mean((dt_pred_test-myval$adjusted_price)^2)))
## [1] "TEST MSE: 36159834.9239597"
I have filter the data set by year by dropping the rows for the year > 2015. I converted the categorical variables with the dummy values for modeling. Created new features such as Luxury_Car, Fleet_Size, and mean_kilometers.
From our modeling and analysis of the created model candidates, I have seen there performance using Mean Squared Error on different data sets. From this analysis, I can choose the better model performance characteristics as I tested on different data subsets.
From the analysis I found that decision tree out performed. I can be infer that the data contains many categorical variables which makes decision tree to better on this data.
Best MSPE achieved is 36318409.051 using Decision Tree Regressor model.
Nonetheless, This analysis not limited. I have more space to improve overall analysis. I can investigate data patterns on complete data set, use further data exploration techniques, hyper parameter tuning of models like variable transformation in linear models, increasing nodes and hidden layers for ANN, considering min-sample-split etc. for decision tree. These can be further implemented to see whether I can extract even more better model.